Bruno A. Galindo
bambrozi@uoguelph.ca
bruno@uenp.edu.br
After finish you GWAS analysis you get a list with significant associations between a marker (SNP) and your trait (phenotype), this list with from this list you will need 3 columns for the following analyzes, the SNP name, the chromosome and position of your markers.
In the image below you can find a summary of Steps, input and output information.
https://www.ensembl.org/Homo_sapiens/Tools/VEP
After perform the GWAS analysis we got as output a list of significan SNPs that are associated with our trait.
The very first step to understand what this SNPs could represent can be a VEP analysis. With this analysis get more information about the location of our variants (SNPs generated by mutations) and therefor infer some consequences of them. We are gonna check if our variant falls into:But, first we need to proivde the rsID of our variants.
The rsID is a “stable” ID that doesn’t change along the time. So is a better use than “comercial” SNP Name. We are gonna get this IDs using the website SNPchimp https://webserver.ibba.cnr.it/SNPchimp/index.php/links
Using the tab BROWSE hit the link for your specie.
Now, paste your SNPnames in the search area separated by comma.
Now you have your rsID and you can download this output in CSV or TSV format.
This session is still under construction!
You can find the VEP in the website https://mart.ensembl.org/info/docs/tools/vep/index.html
Now that you have some insights about the location and consequences of your variants we can move forward and discover which genes and QTL are surrounding our significant SNPs. This step we are going to perform using GALLO package.
Many informations from GALLO you can find on (Fonseca et al. 2020)
To use GALLO you need three different input files:GENE postion: a file that contains the position of known GENES on your reference genome.
Acess the website https://www.animalgenome.org/cgi-bin/QTLdb/index and hit the link for your specie
Now download the file, but pay attention to get the .gff format and the correct version for yourHere is some text that reference genome.
Acess the Ensemble FTP download page https://www.ensembl.org/info/data/ftp/index.html?redirect=no to download the .gtf file
Use the seach area to find your specie, and than download the GTF file.
It is also important double check the reference genome from which you are downloading the gene position. This must be the same genome reference version that the .map file you have used. To do this, hit the specie name,
than check on the left top corner the Reference Genome version. If you need to change, use the drop down menu, and be sure to hit the GO buton.
Now check again in the top left corner the genome reference version, and click on Download GTF
And download the gtf file.
Now you have the two files with gene and QTL position for your reference genome.
The last file that you need is the output of your GWAS analysis, the file with the Chromosome, SNP name and SNP position for your significant SNPs.
Now you are gonna need to load the three imput files as in the code below.
Than you are gonna run the function find_genes_qtls_around_markers setting up the correct method choosing GENE or QTL.
You also need to choose the proper interval that is the range that GALLO will check upstream and downstream of you SNP position.
#GALLO
#import a QTL annotation file
qtl_dbb <- import_gff_gtf(db_file="/Animal_QTLdb_release52_goatCHIR_ARS1.gff.gz",file_type="gff")
#import a gene annotation file
gene_dbb <- import_gff_gtf(db_file="/Capra_hircus.ARS1.111.gtf.gz",file_type="gtf")
#import MARKER files = the GWAS output
gwas <- read.table(file = "/SNP_sig_BH.txt",
head=T, stringsAsFactors = F)
# Assuming "gwas" is your dataframe
gwas <- subset(gwas, select = c(Chr, SNP, bp))
colnames(gwas) <- c("CHR","SNP", "BP")
#FINDING GENES AND QTLs ARROUND THE MARKER
#FINDING GENES
out.genes <- find_genes_qtls_around_markers(db_file= gene_dbb,
marker_file= gwas,
method = "gene",
marker = "snp",
interval = 50000,
nThreads = NULL)
write.table(out.genes, file = "/out_genes_50k.txt",
quote = FALSE, sep = "\t", row.names = FALSE, col.names = T)
#FINDING QTLs
out.qtl <- find_genes_qtls_around_markers(db_file= qtl_dbb,
marker_file= gwas,
method = "qtl",
marker = "snp",
interval = 50000,
nThreads = NULL)
write.table(out.qtl, file = "/out_qtl_50k.txt",
quote = FALSE, sep = "\t", row.names = FALSE, col.names = T)
Now, as output you have two files with the Genes and QTL surrounding your SNP into your interval.
GENE
QTL
GALLO have some tools to summarize our QTL data, the first two are charts that give us the proportion of QTL by class for instance.
With this function you are gonna plot the percentage of each QTL class annotated
#GALLO
par(mar=c(1,30,1,1))
plot_qtl_info(out.qtls, qtl_plot = "qtl_type", cex=2)
plotting the percentage of each trait annoatted for a determined class of QTL, for instance Reproduction QTL
#GALLO
par(mar=c(5,20,1,1))
plot_qtl_info(out.qtls, qtl_plot = "qtl_name", qtl_class="Reproduction")
This section is based in (Dessimoz and Škunca 2017)
Now we have our genes and QTLs surrounding our significant SNPs. We can check for the presence of this genes in specific terms, and test the significance of this “presence”.
ps. from now every time I use the word term could be a GO term (GO: Molecular Function, GO: Cellular Component, GO: Biological Process), KEGG Pathway, REACTOME Pathway and other terms.
Imagine we have the scenario below:
In simple terms we want to know if we have a bigger proportion of annotated genes to a specific term in our sample than in the whole population (chromosome or genome). For instance, is the proportion of genes present in the pathway grater in our sample than in the genome/chromosome? Our H_0 is what I observed in my sample, is this what we would expect just by chance?
To estimate the probability of this scenario we can use the following formula.
(Source: (Dessimoz and Škunca 2017))
With this formula I’ll calculate the probability of our scenario.
But we need to check its significance and to do this we can estimate a p-value that consist in adding two probabilities under the assumption that the null hypothesis is true: 1) a scenario as extreme as our 2) scenarios more extreme than, the one actually observed.
(Modified from: (Dessimoz and Škunca
2017))
In our example we found that we have 0.044 (less than 0.05 or 5%) that our scenario or more extreme scenarios happen just by chance. so we reject H0.
So we can say that our term (pathway or GO term) is enriched (more frequent / present) in our sample than in the whole genome/chromosome.
For more detailed explanation about p-value calculation you can watch the videos below:
As we can test for several terms (pathways) we need to control for multiple testing because, otherwise we can get many false positive just by chance.
Another way to interpret p-value is the number of false positive we “accept”, for example for a p-value of 5% we assume that we can have 5% of false positive among our significant test. Imagine a situation when you perform 400 p-values calculation, if you assume a 0,05 threshold 400 x 0,05 = 20 false positives.
To control false positive in situations of multiple testing we can use for example: Bonferroni or FDR.
In the case of FDR we will correct the p-values for the number of tests, and as consequence usualy we are going to “increase” the p-value, so after this the terms that remain positive we can consider for our next steps as significant enriched terms.
You can find extra information about FDR in this youtube video https://youtu.be/K8LQSvtjcEo?si=rEc6bRkOt0luJhiW
The theory behind the QTL enrichment analysis is pretty much the same that the Gene Enrichment, but instead of testing for presence in a GO term we are going to test for a specific trait, look the scenario bellow, we tried to correlate the image with the QTL enrichment output from GALLO.
The probability formula is:
\[ P = \frac{{\binom{mt}{nt} \cdot \binom{m-mt}{n-nt}}}{{\binom{m}{n}}} \]
Using our example:
\[ P = \frac{{\binom{27}{2} \cdot \binom{2967}{350}}}{{\binom{2994}{352}}} \]
#R Example
chr_ann <- 27
sample_ann <- 2
chr_qtl <- 2994
sample_qtl <- 352
log_numerator <- lchoose(chr_ann, sample_ann) + lchoose((chr_qtl - chr_ann), (sample_qtl - sample_ann))
log_denominator <- lchoose(chr_qtl, sample_qtl)
log_P <- log_numerator - log_denominator
P <- exp(log_P)
But to estimate the p-value you need to sum up all probabilities equal or rarer, in other words I need to calculate the probabilities in the same way that in the previour formula for all situations bellow and then sum up:
2 QTLs out of 27 QTLs (my sample)
3 QTLs out of 27 QTLs
4 QTLs out of 27 QTLs
5 QTLs out of 27 QTLs
.
.
.
27 QTLs out of 27 QTLs
\[ P = \sum_{i=2}^{27} \frac{{\binom{mt}{nt_i} \cdot \binom{m-mt}{n-nt}}}{{\binom{m}{n}}} \]
The output of the formula above will be the p-value and can also be calculated in r using the code below.
#R example
chr_ann <- 27 # These are the the QTLs for a specific trait annotated in a specific chromosome (mt)
chr_qtl <- 2994 # These are ALL QTLs (regardless the trait) annotated for a specific chromosome (Population size = m)
sample_qtl <- 352 # These are the SAMPLE QTLs (regardless the trait) annotated for your sample at a specific chromosome (Sample size = n)
log_denominator <- lchoose(chr_qtl, sample_qtl)
# Initialize a vector to store the results
probabilities <- numeric(26) # We'll have 26 new combinations from 2 to 27
# Loop through different sample sizes from 2 to 27 --> these are the QTLs for a specific trait annotated in a specific chromosome for my SAMPLE (nt)
for (sample_ann in 2:27) {
# Calculate the logarithm of the numerator
log_numerator <- lchoose(chr_ann, sample_ann) + lchoose((chr_qtl - chr_ann), (sample_qtl - sample_ann))
# Calculate the logarithm of P
log_P <- log_numerator - log_denominator
# Calculate P
P <- exp(log_P)
# Store the result
probabilities[sample_ann - 1] <- P
}
# Print the probabilities
print(probabilities)
# Calculate the sum of probabilities
total_probability <- sum(probabilities)
total_probability
#GALLO
#QTL enrichment analysis
out.enrich<-qtl_enrich(qtl_db= qtl_dbb,
qtl_file= out.qtl, qtl_type = "Name",
enrich_type = "chromosome", chr.subset = NULL,
padj = "fdr",nThreads = 2)
The output has this columns
pval: P-value to be used in the plot. If “p_value” informed, a non-adjusted pvalue will be plotted. If “p.adj” informed, the adjusted p-value from the qtl enrichment analysis will be plotted.
Before plot the enrichment results, a new ID column will be created in order to make easier to identify the enrichment results per chromosome. Additonally, we are going to match the QTL classes for each trait and filter the top 10 enriched QTLs.
Creating a new ID composed by the trait and the chromosome
#GALLO
out.enrich$ID<-paste(out.enrich$QTL," - ","CHR",out.enrich$CHR,sep="")
Match the QTL classes and filtering the Reproduction related QTLs
#GALLO
out.enrich.filtered<-out.enrich[which(out.enrich$adj.pval<0.05),]
Plotting the enrichment results for the QTL enrichment analysis
#GALLO
QTLenrich_plot(out.enrich.filtered, x="ID", pval="adj.pval")
To perform the Gene enrichment analysis we are goint to use the application GPROFILER, you can use GPROFILER either in R or online.
### enriquecimento genico
#install.packages("gprofiler2")
library(gprofiler2)
#Para conferir a lista de organism -> https://biit.cs.ut.ee/gprofiler/page/organism-list
#Obs: eu entro com os ids ENSOAR...
query <- read.table ("/out_genes_60k.txt", header = T)
query <- query[,c("gene_id")]
gene_enrich <- gost(
query,
organism = "chircus", #o de cabra é "chircus"
ordered_query = FALSE,
multi_query = FALSE,
significant = TRUE,
exclude_iea = FALSE,
measure_underrepresentation = FALSE,
evcodes = TRUE,
user_threshold = 0.10,
correction_method = c("fdr"),
domain_scope = c("annotated"),
numeric_ns = "",
sources = NULL,
as_short_link = FALSE,
highlight = FALSE
)
#str(gene_enrich) # para ver o formato dos meus dados
#selecionando apenas as informações da lista que me interessam para fazer meu data.frame
result_enrich <- data.frame(gene_enrich$result)
result_enrich <- data.frame(Category = result_enrich$source,
ID = result_enrich$term_id,
Term = result_enrich$term_name,
adj_pvalue = result_enrich$p_value,
id_ensembl = result_enrich$intersection)
write.table(result_enrich,"/home/bambrozi/Sec_ARS1_GrassHill_1/GPROFILER/gene_enrich.txt", col.names=TRUE, row.names=FALSE, sep="\t", quote=F)
gostplot(
gene_enrich,
capped = TRUE,
interactive = T,
pal = c(`GO:MF` = "#dc3912", `GO:BP` = "#ff9900", `GO:CC` = "#109618", KEGG =
"#dd4477", REAC = "#3366cc", WP = "#0099c6", TF = "#5574a6", MIRNA = "#22aa99", HPA =
"#6633cc", CORUM = "#66aa00", HP = "#990099")
)
#to use this function to save, it is necesssary choose interactive =F in the previous step
#to use this function create a object "publish_gostplot" in the previous step.
publish_gostplot(
gostplot_obj,
highlight_terms = NULL,
filename = NULL,
width = NA,
height = NA
)
#CREATE A TABLE
publish_gosttable(
gene_enrich,
highlight_terms = NULL,
use_colors = TRUE,
show_columns = c("source", "term_name", "term_size", "intersection_size"),
filename = "result_table.pdf",
ggplot = TRUE
)
https://string-db.org/ This session is still under construction!